Skip to content

Project DataPackets before AI prompt serialization#1800

Merged
chubes4 merged 4 commits intomainfrom
fix-ai-packet-projection
May 6, 2026
Merged

Project DataPackets before AI prompt serialization#1800
chubes4 merged 4 commits intomainfrom
fix-ai-packet-projection

Conversation

@chubes4
Copy link
Copy Markdown
Member

@chubes4 chubes4 commented May 6, 2026

Summary

  • Add DataPacketPromptProjector so AI prompts receive compact packet projections while canonical DataPackets remain unchanged for storage, replay, and downstream engine state.
  • Use the projector from both AIStep and RequestInspector, remove pretty-printed prompt packet JSON, and report canonical vs projected packet byte metrics.
  • Flatten MCP/MGS packets by decoding JSON data.body, preserving useful source/provenance fields, stripping <em> snippet highlights, and omitting duplicate raw MCP metadata.

Closes #1799.

Tests

  • php tests/ai-packet-projection-smoke.php
  • php tests/ai-request-inspector-smoke.php
  • ./vendor/bin/phpcs inc/Engine/AI/DataPacketPromptProjector.php inc/Core/Steps/AI/AIStep.php inc/Engine/AI/RequestInspector.php inc/Cli/Commands/AICommand.php tests/ai-packet-projection-smoke.php tests/ai-request-inspector-smoke.php tests/Unit/Core/Steps/AI/AIStepTest.php
  • git diff --check

Additional test signal

  • composer test -- --filter AIStepTest ran the full Homeboy/Playground suite instead of honoring the filter: 1,211 tests, 1 unrelated failure in DataMachine\\Tests\\Unit\\Abilities\\ImageGenerationPromptRefinementTest::test_refine_prompt_includes_post_context_when_provided ('' did not contain Article context:). The AI step tests passed during that run.

AI assistance

  • AI assistance: Yes
  • Tool(s): OpenCode (GPT-5.5)
  • Used for: Drafted the projection implementation, smoke/unit coverage, and verification commands; Chris remains responsible for review and merge.

@homeboy-ci
Copy link
Copy Markdown
Contributor

homeboy-ci Bot commented May 6, 2026

Homeboy Results — data-machine

Lint

lint — passed

ℹ️ Full options: homeboy docs commands/lint
Deep dive: homeboy lint data-machine --changed-since 081f9fd

Test

test — passed

  • 11 passed

ℹ️ Auto-fix lint issues: homeboy refactor data-machine --from lint --write
ℹ️ Collect coverage: homeboy test data-machine --coverage
ℹ️ Save test baseline: homeboy test data-machine --baseline
ℹ️ Pass args to test runner: homeboy test -- [args]
ℹ️ Full options: homeboy docs commands/test
Deep dive: homeboy test data-machine --changed-since 081f9fd

Audit

audit — passed

  • test_coverage — 273 finding(s)
  • dead_code — 121 finding(s)
  • intra-method-duplication — 58 finding(s)
  • requested_detectors — 28 finding(s)
  • repeated_literal_shape — 13 finding(s)
  • parallel-implementation — 12 finding(s)
  • dead_guard — 10 finding(s)
  • field_patterns — 5 finding(s)
  • Abilities — 4 finding(s)
  • Flow — 4 finding(s)
  • Total: 540 finding(s)

Deep dive: homeboy audit data-machine --changed-since 081f9fd

Tooling versions
  • Homeboy CLI: homeboy 0.157.1+1aee1b82
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: e75b247
  • Action: Extra-Chill/homeboy-action@v2

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Review blocker before merge:

DataPacketPromptProjector::projectMcpPacket() currently reads matching_content through firstString(), but real MGS packets use matching_content as an array of snippets. Because firstString() skips arrays, the projection drops the most useful source text for MGS search results and leaves mostly title/url/date/author/tags. That is too aggressive and likely to cause behavioral regressions or excessive skips.

Please preserve matching_content arrays as cleaned snippet arrays, stripping <em> tags per element, while keeping scalar snippet/excerpt support. Add a fixture/test that matches the real MGS shape where matching_content is an array, and assert the projected packet is still smaller than canonical while retaining the cleaned snippets.

CI is green and the overall architecture looks right; this should be a small targeted follow-up.

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Review blocker addressed in follow-up commit 24c22780.

Changes:

  • Preserves real MGS matching_content arrays as cleaned snippet arrays.
  • Still supports scalar matching_content, snippet, and excerpt values.
  • Updated smoke/unit fixtures to use the real MGS array shape and assert all cleaned snippets remain present while projected JSON is smaller than canonical JSON.

Verification rerun:

  • php tests/ai-packet-projection-smoke.php - 9 assertions, 0 failures
  • php tests/ai-request-inspector-smoke.php - 33 assertions, 0 failures
  • ./vendor/bin/phpcs inc/Engine/AI/DataPacketPromptProjector.php tests/ai-packet-projection-smoke.php tests/Unit/Core/Steps/AI/AIStepTest.php
  • git diff --check

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Second blocker after the snippet-array fix:

The projection still re-adds the raw JSON source object through the body fallback.

Current shape:

$source = self::decodeJsonObject( $data['body'] ?? null );
...
'body' => self::firstString( $source, $data, array(), array( 'content', 'body', 'text', 'summary', 'description' ) ),

For real MGS search packets, data.body is the JSON-encoded source object. After decoding, $source usually has no content/body/text/summary/description; it has title, url, date, author, matching_content, etc. firstString() then falls back to wrapper $data['body'] and places the full raw JSON string back into projected_data['body'].

That defeats the main dedupe goal: the prompt still gets the raw source object, plus flattened fields/snippets.

Please change the logic so a successfully decoded JSON wrapper body is not reused as the projected body fallback. Use body/content fields from the decoded source if present, or use wrapper data.body only when decode failed / it is plain text. Add a test asserting projected MCP JSON-body packets do not include the original raw JSON string in projected[0]['data']['body'].

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Architecture blocker: this currently makes Data Machine aware of Intelligence/MCP packet internals.

I verified ownership:

  • Intelligence/inc/handlers/class-mcp-fetch-handler.php owns the MCP fetch handler and explicitly notes Data Machine core has no MCP.
  • Data Machine core only has generic queue patch support/comments referencing MCP config shapes.

PR #1800 currently adds MCP-specific packet projection to Data Machine:

  • isMcpPacket()
  • projectMcpPacket()
  • direct knowledge of mcp_raw_item, mcp_url, mcp_tool, mcp_provider, and MGS-style matching_content

That crosses the Data Machine/Intelligence boundary.

Recommended revision:

  1. Data Machine adds the generic projection layer and request-inspector metrics.
  2. Data Machine default projection remains source-agnostic: strip local runtime-only fields like file_info.file_path, compact JSON, no JSON_PRETTY_PRINT, and conservative generic fallback.
  3. Data Machine exposes a filter/extension point such as datamachine_ai_project_data_packet or similar.
  4. Intelligence registers the MCP/MGS-specific projection near its MCP fetch handler/source integration layer.
  5. Data Machine tests cover the generic projection and filter contract. MCP/MGS projection tests belong in Intelligence.

This PR should not merge with MCP-specific key knowledge in Data Machine core.

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Architecture blocker addressed in follow-up commit 70c37fa4.

Changes:

  • Removed MCP/MGS-specific projection from Data Machine core (mcp_*, matching_content, isMcpPacket(), projectMcpPacket()).
  • Data Machine now provides a source-agnostic default projector that only strips local runtime file_info.file_path.
  • Added datamachine_ai_project_data_packet filter so Intelligence or another source-owning integration can provide source-specific compaction near its handler.
  • Replaced MCP/MGS tests with generic projection and filter-contract coverage.

Verification rerun:

  • php tests/ai-packet-projection-smoke.php - 11 assertions, 0 failures
  • php tests/ai-request-inspector-smoke.php - 33 assertions, 0 failures
  • ./vendor/bin/phpcs inc/Engine/AI/DataPacketPromptProjector.php tests/ai-packet-projection-smoke.php tests/Unit/Core/Steps/AI/AIStepTest.php
  • git diff --check

I also searched the Data Machine projection files/tests for the blocked MCP/MGS key names; no matches remain in the changed projection surface.

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Final API hardening request before merge:

Please add a minimal, source-agnostic context array as the third filter argument.

Current:

apply_filters( 'datamachine_ai_project_data_packet', $projected, $packet );

Target:

apply_filters(
    'datamachine_ai_project_data_packet',
    $projected,
    $packet,
    $context
);

Context should stay generic, for example job_id, pipeline_id, flow_id, flow_step_id, and pipeline_step_id where available. Do not pass the whole EngineData object and do not include MCP/Intelligence/source-specific fields.

Please update DataPacketPromptProjector::project(), AIStep, RequestInspector, and the smoke/unit tests so the filter contract is covered. Existing 2-arg filters remain compatible because WordPress filters respect accepted args.

@chubes4
Copy link
Copy Markdown
Member Author

chubes4 commented May 6, 2026

Final API hardening addressed in follow-up commit 59be00fe.

Changes:

  • DataPacketPromptProjector::project() now accepts optional source-agnostic context.
  • datamachine_ai_project_data_packet now receives ($projected, $packet, $context).
  • AIStep and RequestInspector pass minimal runtime IDs only: job_id, pipeline_id, flow_id, flow_step_id, pipeline_step_id.
  • No EngineData object is passed to projection filters.
  • Existing 2-arg filters remain compatible through WordPress accepted_args behavior.
  • Smoke/unit coverage now asserts a 3-arg filter receives context and that a 2-arg filter still works.

Verification rerun:

  • php tests/ai-packet-projection-smoke.php - 12 assertions, 0 failures
  • php tests/ai-request-inspector-smoke.php - 33 assertions, 0 failures
  • ./vendor/bin/phpcs inc/Engine/AI/DataPacketPromptProjector.php inc/Core/Steps/AI/AIStep.php inc/Engine/AI/RequestInspector.php tests/ai-packet-projection-smoke.php tests/Unit/Core/Steps/AI/AIStepTest.php
  • git diff --check

@chubes4 chubes4 merged commit 98885d2 into main May 6, 2026
3 checks passed
@chubes4 chubes4 deleted the fix-ai-packet-projection branch May 6, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Project DataPackets before AI prompt serialization

1 participant